This week’s assignment was to evaluate the performance of non-linear regression methods to explore the relationships between variables relating to the personal history of male workers in the Mid-Atlantic region of the US and their wages. To compare the methods ability to describe relationships in the data, the data was split into training and testing datasets using a 75:25 ratio, and the RMSE was calculated for each non-linear method used. ANOVA and cross-validation were used in feature selection and to tune parameters. Polynomial regression and smoothing splines were used for continuous variables with linear methods utilized for categorical variables, and general additive models were also applied to the data. All methods produced similar results in terms of RMSE, although the cubic polynomial model including all variables produced the lowest RMSE in the test data, and least variance from the training data results.
The Wage data from ISLR is a dataset that contains personal information relating to 3000 male workers in the Mid-Atlantic. The information recorded are seen below.
tibble(
Variable = names(Wage),
Description = c(
"Yearthat wage information was recorded",
"Age of worker",
"A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status",
"A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race",
"A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level",
"Region of the country (mid-atlantic only)",
"A factor with levels 1. Industrial and 2. Information indicating type of job",
"A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker",
"A factor with levels 1. Yes and 2. No indicating whether worker has health insurance",
"Log of workers wage",
"Workers raw wage"
)
) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F)
| Variable | Description |
|---|---|
| year | Yearthat wage information was recorded |
| age | Age of worker |
| maritl | A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status |
| race | A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race |
| education | A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level |
| region | Region of the country (mid-atlantic only) |
| jobclass | A factor with levels 1. Industrial and 2. Information indicating type of job |
| health | A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker |
| health_ins | A factor with levels 1. Yes and 2. No indicating whether worker has health insurance |
| logwage | Log of workers wage |
| wage | Workers raw wage |
Exploratory data analysis was first conducted to identify potential missing values, non-normality in the predictor and response variables, and correlations between the predictors and the response variable (wage). As no individuals were recorded from a region outside of the Mid-Atlantic, the region variable was dropped from the dataset. After necessary transformations were made to the data, training and testing datasets were created using a 75%:25% split, using the response variable as the strata to split along, minimizing the chance of unequal groupings of wage between the training and testing sets.
The following non-linear methods were applied to the continuous variables within the training data: polynomial regression, GAMs with smoothing splines, and GAMs with local regression. Starting with polynomial regression, cubic functions were fitted implementing combinations of the predictor variables, and ANOVA utilized to determine the improvement to model fit with subsequent additions to the model. Then cross-validation using RMSE was used to confirm the optimal degree of the polynomial regression. The same techniques were used to evaluate the model fit among the GAMs, with cross-validation used to optimize the degrees of freedom in smoothing splines, and the span in local regression.
From the EDA, non-normality could be observed in the wage variable, however, this was largely resolved through log transformation (see figures below).
ggplot(Wage) +
geom_density(aes(x = wage), fill = "pink") +
geom_vline(xintercept = mean(wage), col = "dodgerblue", size = 1) +
geom_vline(xintercept = median(wage), col = "orangered4", size = 1) +
annotate("text", label = "mean", x = mean(wage) + 10, y = 0.004) +
annotate("text", label = "median", x = median(wage) - 10, y = 0.004) +
ggtitle("Density plot of wage")
ggplot(Wage) +
geom_density(aes(x = logwage), fill = "cornsilk") +
geom_vline(xintercept = mean(logwage), col = "dodgerblue", size = 1) +
geom_vline(xintercept = median(logwage), col = "orangered4", size = 1) +
annotate("text", label = "mean", x = mean(logwage) + 0.1, y = 0.004) +
annotate("text", label = "median", x = median(logwage) - 0.1, y = 0.004) +
ggtitle("Density plot of logwage")
Plotting combinations of the predictors variables against the response (logwage), a number of general observations could be made: generally, older men earned more than younger men; married men typically earned more money than unmarried men; Asian men typically earned more money than men of other races; there was a strong positive trend of increased wage with increased educational levels; men working in information job types earned more than industrial job types; men in very good or better health earned more money, and men without health insurance earned more money. These can be seen in the figures below.
ggplot(Wage, aes(x = age, y = logwage)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ poly(x, 3)") +
ggtitle(label = "Log Wage vs Age", subtitle = "Cubic polynomial fit")
for (i in 3:9){
p <- ggplot(Wage) +
geom_boxplot(aes(x = Wage[, i], y = logwage, fill = Wage[, i])) +
ggtitle(label = paste0("Log Wage vs ", names(Wage[i]))) +
xlab(label = names(Wage)[i])
print(p)
}
After EDA, non-linear methods were fitted to the data and evaluated using ANOVA to test model fit after introducing additional predictors, and cross-validation to optimize each method. The results of these fits and optimizations on the RMSE within both training and testing datasets can be seen in the table below. The best fitting model on the test data proved to be the cubic polynomial utilizing all of the predictor variables. As age was the only continuous variable, the rest of the predictors were input directly into the lm function, which encodes the factors as dummy variables in the model fitting process. Whilst GAM with local regression provided marginally lower RMSE in the training data with a span = 0.2, it did not produce a smooth function in the modelling of logwage vs age resulting in a higher test RMSE. Setting span = 0.5 remedied this issue, but both training and test RMSE was higher than observed in the cubic model. In the final model, the same trends and correlations between predictors and response variables observed in the EDA were seen to be significant in the final combined model summary, with the additional identification of year as being a significant predictor (slight increase in logwage over time), and that Asian men did not earn significantly more than men of other races, but Black men did earn less (slight statistical significance where p = 0.053).
tibble(
Method = c(
"Cubic",
"Cubic",
"Cubic",
"Cubic",
"Step",
"Step",
"GAM - Smoothing (df = 4)",
"GAM - Smoothing (df = 4)",
"GAM - Local (span = 0.5)",
"GAM - Local (span = 0.5)"
),
`Variables included` = c(
"Age",
"Age",
"All",
"All",
"Year",
"Year",
"All",
"All",
"All",
"All"
),
Data = c(
"Training",
"Test",
"Training",
"Test",
"Training",
"Test",
"Training",
"Test",
"Training",
"Test"
),
RMSE = c(
"0.3283",
"0.3395",
"0.2728",
"0.2797",
"0.3479",
"0.3578",
"0.2727",
"0.2798",
"0.2724",
"0.2801"
)
) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F) %>%
row_spec(row = 3:4, bold = TRUE)
| Method | Variables included | Data | RMSE |
|---|---|---|---|
| Cubic | Age | Training | 0.3283 |
| Cubic | Age | Test | 0.3395 |
| Cubic | All | Training | 0.2728 |
| Cubic | All | Test | 0.2797 |
| Step | Year | Training | 0.3479 |
| Step | Year | Test | 0.3578 |
| GAM - Smoothing (df = 4) | All | Training | 0.2727 |
| GAM - Smoothing (df = 4) | All | Test | 0.2798 |
| GAM - Local (span = 0.5) | All | Training | 0.2724 |
| GAM - Local (span = 0.5) | All | Test | 0.2801 |
In conclusion, examining the Wages dataset provided the opportunity to apply non-linear methods to ascertain relationships between personal information responses and the wage earned among males working in the Mid-Atlantic. All methods tested provided similar results when assessed on training and test datasets using RMSE, but a cubic polynomial function resulted in the smallest descrepancy between training and test results, i.e. lowest variance. In the final model, all variables were included as they improved model fit, as determined by ANOVA, and wage was found to be positively associated with: age, married men (but did not affect men of other marital status), informational type jobs, and men in good health and with health insurance, with wages increasing over time, and Black men earning slightly less than men of other races.
library(ISLR)
library(tidyverse)
library(splines)
library(gam)
library(akima)
library(kableExtra)
library(hrbrthemes)
library(janitor)
library(rsample)
library(corrplot)
library(caret)
library(caretEnsemble)
library(PerformanceAnalytics)
RNGkind(sample.kind = "Rounding")
set.seed(1)
theme_set(theme_ipsum())
data(Wage)
dim(Wage)
[1] 3000 11
class(Wage)
[1] "data.frame"
names(Wage)
[1] "year" "age" "maritl" "race" "education" "region" "jobclass" "health" "health_ins"
[10] "logwage" "wage"
glimpse(Wage)
Rows: 3,000
Columns: 11
$ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004, 2007, 2007, 2005, 2003, 2009, 2009, 2003, 2…
$ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54, 51, 37, 50, 56, 37, 38, 40, 75, 40, 38, 49,…
$ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married, 4. Divorced, 2. Married, 2. Married, 1. Nev…
$ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White, 4. Other, 3. Asian, 2. Black, 1. White, 1. …
$ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College Grad, 2. HS Grad, 4. College Grad, 3. Some C…
$ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic, 2. …
$ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Information, 2. Information, 2. Information, 1. Indust…
$ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 2. >=Very Good, 1. <=G…
$ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 2. No, 1. Yes, 1.…
$ logwage <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098, 5.133021, 4.716003, 4.778151, 4.857332, 4.7…
$ wage <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.11574, 169.52854, 111.72085, 118.88436, 128.680…
summary(Wage)
year age maritl race education
Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480 1. < HS Grad :268
1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293 2. HS Grad :971
Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190 3. Some College :650
Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37 4. College Grad :685
3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55 5. Advanced Degree:426
Max. :2009 Max. :80.00
region jobclass health health_ins logwage wage
2. Middle Atlantic :3000 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
1. New England : 0 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
3. East North Central: 0 Median :4.653 Median :104.92
4. West North Central: 0 Mean :4.654 Mean :111.70
5. South Atlantic : 0 3rd Qu.:4.857 3rd Qu.:128.68
6. East South Central: 0 Max. :5.763 Max. :318.34
(Other) : 0
anyNA(Wage)
[1] FALSE
ggplot(Wage) +
geom_density(aes(x = wage), fill = "pink") +
geom_vline(xintercept = mean(wage), col = "dodgerblue", size = 1) +
geom_vline(xintercept = median(wage), col = "orangered4", size = 1) +
annotate("text", label = "mean", x = mean(wage) + 10, y = 0.004) +
annotate("text", label = "median", x = median(wage) - 10, y = 0.004) +
ggtitle("Density plot of wage")
ggplot(Wage) +
geom_density(aes(x = logwage), fill = "cornsilk") +
geom_vline(xintercept = mean(logwage), col = "dodgerblue", size = 1) +
geom_vline(xintercept = median(logwage), col = "orangered4", size = 1) +
annotate("text", label = "mean", x = mean(logwage) + 0.1, y = 0.004) +
annotate("text", label = "median", x = median(logwage) - 0.1, y = 0.004) +
ggtitle("Density plot of logwage")
corrplot(cor(Wage[, -c(3:9)]), method = "square")
chart.Correlation(Wage[, -c(3:9)])
for (i in 1:5){
p <- ggplot(Wage) +
geom_histogram(aes(x = Wage[, i]), stat = "count") +
xlab(label = names(Wage)[i])
print(p)
}
tabyl(Wage$maritl)
Wage$maritl n percent
1. Never Married 648 0.216000000
2. Married 2074 0.691333333
3. Widowed 19 0.006333333
4. Divorced 204 0.068000000
5. Separated 55 0.018333333
tabyl(Wage$race)
Wage$race n percent
1. White 2480 0.82666667
2. Black 293 0.09766667
3. Asian 190 0.06333333
4. Other 37 0.01233333
tabyl(Wage$education)
Wage$education n percent
1. < HS Grad 268 0.08933333
2. HS Grad 971 0.32366667
3. Some College 650 0.21666667
4. College Grad 685 0.22833333
5. Advanced Degree 426 0.14200000
tabyl(Wage$region)
Wage$region n percent
1. New England 0 0
2. Middle Atlantic 3000 1
3. East North Central 0 0
4. West North Central 0 0
5. South Atlantic 0 0
6. East South Central 0 0
7. West South Central 0 0
8. Mountain 0 0
9. Pacific 0 0
tabyl(Wage$jobclass)
Wage$jobclass n percent
1. Industrial 1544 0.5146667
2. Information 1456 0.4853333
tabyl(Wage$health)
Wage$health n percent
1. <=Good 858 0.286
2. >=Very Good 2142 0.714
tabyl(Wage$health_ins)
Wage$health_ins n percent
1. Yes 2083 0.6943333
2. No 917 0.3056667
clean_wage <- Wage %>%
select(everything(), -c(region))
ggplot(Wage) +
geom_boxplot(aes(x = factor(year), y = logwage, fill = year))
ggplot(Wage) +
geom_point(aes(x = age, y = logwage))
for (i in 3:9){
p <- ggplot(Wage) +
geom_boxplot(aes(x = Wage[, i], y = logwage, fill = Wage[, i])) +
xlab(label = names(Wage)[i])
print(p)
}
General increasing trend in wage by education, jobclass, and health. General decreasing trend by health_ins
set.seed(1)
wage_split <- initial_split(clean_wage, strata = wage, prop = 0.75)
train_data <- training(wage_split)
test_data <- testing(wage_split)
poly_1 <- lm(logwage ~ poly(age, 3), data = train_data)
poly_2 <- lm(logwage ~ poly(age, 4), data = train_data)
poly_3 <- lm(logwage ~ poly(age, 3) + education, data = train_data)
poly_4 <- lm(logwage ~ poly(age, 4) + education, data = train_data)
anova(poly_1, poly_2, poly_3, poly_4)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 4)
Model 3: logwage ~ poly(age, 3) + education
Model 4: logwage ~ poly(age, 4) + education
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2246 242.02 1 0.607 7.1260 0.007652 **
3 2243 191.07 3 50.949 199.4548 < 2.2e-16 ***
4 2242 190.90 1 0.176 2.0725 0.150114
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_2, poly_4)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 4)
Model 2: logwage ~ poly(age, 4) + education
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2246 242.02
2 2242 190.90 4 51.125 150.11 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ggplot(train_data, aes(x = age, y = logwage)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ poly(x, 3)") +
geom_smooth(method = "lm", formula = "y ~ poly(x, 4)", color = "red")
Doesn’t seem to be major improvement past cubic polynomial once other predictors are added.
poly_3 <- lm(logwage ~ poly(age, 3) + education, data = train_data)
poly_5 <- lm(logwage ~ poly(age, 3) + education + jobclass, data = train_data)
poly_6 <- lm(logwage ~ poly(age, 3) + education + jobclass + health, data = train_data)
poly_7 <- lm(logwage ~ poly(age, 3) + education + jobclass + health + health_ins, data = train_data)
poly_8 <- lm(logwage ~ poly(age, 3) + education + jobclass + health + health_ins + maritl, data = train_data)
poly_9 <- lm(logwage ~ poly(age, 3) + education + jobclass + health + health_ins + maritl + year, data = train_data)
poly_10 <- lm(logwage ~ poly(age, 3) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
anova(poly_3, poly_5, poly_6, poly_7, poly_8, poly_9, poly_10)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3) + education
Model 2: logwage ~ poly(age, 3) + education + jobclass
Model 3: logwage ~ poly(age, 3) + education + jobclass + health
Model 4: logwage ~ poly(age, 3) + education + jobclass + health + health_ins
Model 5: logwage ~ poly(age, 3) + education + jobclass + health + health_ins +
maritl
Model 6: logwage ~ poly(age, 3) + education + jobclass + health + health_ins +
maritl + year
Model 7: logwage ~ poly(age, 3) + education + jobclass + health + health_ins +
maritl + year + race
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2243 191.07
2 2242 190.72 1 0.3519 4.6884 0.03047 *
3 2241 188.52 1 2.2019 29.3337 6.747e-08 ***
4 2240 175.50 1 13.0171 173.4142 < 2.2e-16 ***
5 2236 169.43 4 6.0709 20.2194 2.242e-16 ***
6 2235 167.98 1 1.4517 19.3394 1.146e-05 ***
7 2232 167.54 3 0.4400 1.9541 0.11884
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
poly_1 <- lm(logwage ~ poly(age, 3), data = train_data)
poly_3 <- lm(logwage ~ poly(age, 3) + education, data = train_data)
poly_8 <- lm(logwage ~ poly(age, 3) + jobclass, data = train_data)
poly_9 <- lm(logwage ~ poly(age, 3) + health, data = train_data)
poly_10 <- lm(logwage ~ poly(age, 3) + health_ins, data = train_data)
poly_11 <- lm(logwage ~ poly(age, 3) + maritl, data = train_data)
poly_12 <- lm(logwage ~ poly(age, 3) + year, data = train_data)
poly_13 <- lm(logwage ~ poly(age, 3) + race, data = train_data)
anova(poly_1, poly_3)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + education
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2243 191.07 4 51.555 151.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_8)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + jobclass
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2246 235.13 1 7.5022 71.662 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_9)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + health
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2246 234.64 1 7.9858 76.44 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_10)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + health_ins
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2246 218.60 1 24.026 246.85 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_11)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + maritl
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2243 234.59 4 8.0354 19.207 1.506e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_12)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + year
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2246 240.77 1 1.8613 17.363 3.204e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(poly_1, poly_13)
Analysis of Variance Table
Model 1: logwage ~ poly(age, 3)
Model 2: logwage ~ poly(age, 3) + race
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2247 242.63
2 2244 240.26 3 2.3738 7.3904 6.316e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adding each variable individually to the cubic polynomial of age improved model fit, and adding all variables successively improved fit over the previous - include all terms in final model
poly_cv_error <- rep(0, 10)
for (i in 1:10) {
poly <- lm(logwage ~ poly(age, i) + education + race + jobclass + health + health_ins + maritl + year, data = train_data)
poly_train_preds <- predict(poly, newdata = train_data)
poly_cv_error[i] <- sqrt(mean((train_data$logwage - poly_train_preds)^2))
}
plot(1:10, poly_cv_error, pch = 19, type = "b", main = "CV of Polynomial RMSE in Training Data", xlab = "Degrees of polynomial", ylab = "RMSE")
grid()
poly_final <- lm(logwage ~ poly(age, 3) + education + race + jobclass + health + health_ins + maritl + year, data = train_data)
poly_1_train_preds <- predict(poly_1, newdata = train_data)
poly_1_train_rmse <- sqrt(mean((train_data$logwage - poly_1_train_preds)^2))
poly_final_train_preds <- predict(poly_final, newdata = train_data)
poly_final_train_rmse <- sqrt(mean((train_data$logwage - poly_final_train_preds)^2))
poly_1_train_rmse
[1] 0.3283105
poly_final_train_rmse
[1] 0.2728182
poly_1_test_preds <- predict(poly_1, newdata = test_data)
poly_1_test_rmse <- sqrt(mean((test_data$logwage - poly_1_test_preds)^2))
poly_final_test_preds <- predict(poly_final, newdata = test_data)
poly_test_rmse <- sqrt(mean((test_data$logwage - poly_final_test_preds)^2))
poly_1_test_rmse
[1] 0.339511
poly_test_rmse
[1] 0.2796641
Confirms that adding extra variables improved model fit over just cubic age polynomial. Very similar values for training and test data indicates low variance.
summary(poly_final)
Call:
lm(formula = logwage ~ poly(age, 3) + education + race + jobclass +
health + health_ins + maritl + year, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-1.56313 -0.14468 0.00307 0.15782 1.22616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.188773 5.709478 -3.711 0.000211 ***
poly(age, 3)1 1.808014 0.324230 5.576 2.75e-08 ***
poly(age, 3)2 -2.135448 0.292437 -7.302 3.92e-13 ***
poly(age, 3)3 0.760058 0.280012 2.714 0.006691 **
education2. HS Grad 0.088780 0.022020 4.032 5.72e-05 ***
education3. Some College 0.178204 0.023348 7.633 3.39e-14 ***
education4. College Grad 0.277628 0.023725 11.702 < 2e-16 ***
education5. Advanced Degree 0.433694 0.026162 16.577 < 2e-16 ***
race2. Black -0.039153 0.020227 -1.936 0.053031 .
race3. Asian -0.025968 0.025058 -1.036 0.300169
race4. Other -0.064683 0.051601 -1.254 0.210149
jobclass2. Information 0.016901 0.012351 1.368 0.171354
health2. >=Very Good 0.048332 0.013218 3.657 0.000262 ***
health_ins2. No -0.176236 0.013212 -13.340 < 2e-16 ***
maritl2. Married 0.125217 0.017252 7.258 5.40e-13 ***
maritl3. Widowed 0.036124 0.072957 0.495 0.620544
maritl4. Divorced -0.005757 0.027869 -0.207 0.836361
maritl5. Separated 0.066160 0.044323 1.493 0.135659
year 0.012751 0.002846 4.480 7.85e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.274 on 2232 degrees of freedom
Multiple R-squared: 0.3897, Adjusted R-squared: 0.3848
F-statistic: 79.2 on 18 and 2232 DF, p-value: < 2.2e-16
Only year makes sense to use for a step function as only numerical category (aside from age), and has natural cut points.
step_year <- lm(logwage ~ cut(year, breaks = 2002:2009), data = train_data)
year_step_fit <- tibble(
year = seq(2003, 2009, 0.1),
logwage = predict(step_year, tibble(year = seq(2003, 2009, 0.1)))
)
ggplot(train_data, aes(x = year, y = logwage)) +
geom_point() +
geom_line(
data = year_step_fit,
aes(x = year, y = logwage)
) +
scale_x_continuous(breaks = 2003:2009, minor_breaks = NULL)
step_year_train_preds <- predict(step_year, newdata = train_data)
step_year_train_rmse <- sqrt(mean((train_data$logwage - step_year_train_preds)^2))
step_year_test_preds <- predict(step_year, newdata = test_data)
step_year_test_rmse <- sqrt(mean((test_data$logwage - step_year_test_preds)^2))
step_year_train_rmse
[1] 0.3478821
step_year_test_rmse
[1] 0.3577684
Step function of year is a worse predictor than the polynomial fits
gam_1 <- gam(logwage ~ s(age, 4), data = train_data)
gam_2 <- gam(logwage ~ s(age, 4) + education, data = train_data)
gam_3 <- gam(logwage ~ s(age, 4) + jobclass, data = train_data)
gam_4 <- gam(logwage ~ s(age, 4) + health, data = train_data)
gam_5 <- gam(logwage ~ s(age, 4) + health_ins, data = train_data)
gam_6 <- gam(logwage ~ s(age, 4) + maritl, data = train_data)
gam_7 <- gam(logwage ~ s(age, 4) + year, data = train_data)
gam_8 <- gam(logwage ~ s(age, 4) + race, data = train_data)
anova(gam_1, gam_2)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + education
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2242 190.89 4 51.505 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_3)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + jobclass
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2245 234.91 1 7.4918 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_4)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + health
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.4
2 2245 234.4 1 7.996 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_5)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + health_ins
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2245 218.26 1 24.141 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_6)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + maritl
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.4
2 2242 234.5 4 7.8994 1.543e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_7)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + year
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2245 240.52 1 1.8734 2.894e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gam_1, gam_8)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + race
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2243 240.07 3 2.3268 7.389e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
gam_1 <- gam(logwage ~ s(age, 4), data = train_data)
gam_2 <- gam(logwage ~ s(age, 4) + education, data = train_data)
gam_3 <- gam(logwage ~ s(age, 4) + education + jobclass, data = train_data)
gam_4 <- gam(logwage ~ s(age, 4) + education + jobclass + health, data = train_data)
gam_5 <- gam(logwage ~ s(age, 4) + education + jobclass + health + health_ins, data = train_data)
gam_6 <- gam(logwage ~ s(age, 4) + education + jobclass + health + health_ins + maritl, data = train_data)
gam_7 <- gam(logwage ~ s(age, 4) + education + jobclass + health + health_ins + maritl + year, data = train_data)
gam_8 <- gam(logwage ~ s(age, 4) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
anova(gam_1, gam_2, gam_3, gam_4, gam_5, gam_6, gam_7, gam_8)
Analysis of Deviance Table
Model 1: logwage ~ s(age, 4)
Model 2: logwage ~ s(age, 4) + education
Model 3: logwage ~ s(age, 4) + education + jobclass
Model 4: logwage ~ s(age, 4) + education + jobclass + health
Model 5: logwage ~ s(age, 4) + education + jobclass + health + health_ins
Model 6: logwage ~ s(age, 4) + education + jobclass + health + health_ins +
maritl
Model 7: logwage ~ s(age, 4) + education + jobclass + health + health_ins +
maritl + year
Model 8: logwage ~ s(age, 4) + education + jobclass + health + health_ins +
maritl + year + race
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2246 242.40
2 2242 190.89 4 51.505 < 2.2e-16 ***
3 2241 190.54 1 0.352 0.03031 *
4 2240 188.33 1 2.210 5.703e-08 ***
5 2239 175.25 1 13.074 < 2.2e-16 ***
6 2235 169.30 4 5.954 2.388e-16 ***
7 2234 167.82 1 1.482 8.819e-06 ***
8 2231 167.38 3 0.435 0.12160
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adding each variable individually to the cubic polynomial of age improved model fit, and adding all variables successively improved fit over the previous - include all terms in final model
gam_cv_error <- rep(0, 10)
for (i in 1:10) {
gam <- gam(logwage ~ s(age, i) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
gam_train_preds <- predict(gam, newdata = train_data)
gam_cv_error[i] <- sqrt(mean((train_data$logwage - gam_train_preds)^2))
}
plot(1:10, gam_cv_error, pch = 19, type = "b", main = "CV of GAM RMSE in Training Data", xlab = "Degrees of freedom", ylab = "RMSE")
grid()
gam_final <- gam(logwage ~ s(age, 4) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
gam_final_train_preds <- predict(gam_final, newdata = train_data)
gam_final_train_rmse <- sqrt(mean((train_data$logwage - gam_final_train_preds)^2))
gam_final_train_rmse
[1] 0.2726902
gam_final_test_preds <- predict(gam_final, newdata = test_data)
gam_final_test_rmse <- sqrt(mean((test_data$logwage - gam_final_test_preds)^2))
gam_final_test_rmse
[1] 0.2798356
Very similar performance using smoothing splines in GAM to polynomial. Elbow in CV indicates 4 df best comprimise.
par(mfrow = c(2, 2))
plot(gam_final, se = T, col = "blue")
gam_lo_cv_error <- rep(0, 10)
for (i in 1:10) {
gam_lo <- gam(logwage ~ lo(age, span = i/10) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
gam_lo_train_preds <- predict(gam_lo, newdata = train_data)
gam_lo_cv_error[i] <- sqrt(mean((train_data$logwage - gam_lo_train_preds)^2))
}
plot(1:10/10, gam_lo_cv_error, pch = 19, type = "b", main = "CV of GAM Local Regression RMSE in Training Data", xlab = "Span", ylab = "RMSE")
grid()
gam_lo_span_final <- gam(logwage ~ lo(age, span = 0.2) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
gam_lo_span_final_train_preds <- predict(gam_lo_span_final, newdata = train_data)
gam_lo_span_final_train_rmse <- sqrt(mean((train_data$logwage - gam_lo_span_final_train_preds)^2))
gam_lo_span_final_train_rmse
[1] 0.272399
gam_lo_span_final_test_preds <- predict(gam_lo_span_final, newdata = test_data)
gam_lo_span_final_test_rmse <- sqrt(mean((test_data$logwage - gam_lo_span_final_test_preds)^2))
gam_lo_span_final_test_rmse
[1] 0.2801312
par(mfrow = c(2, 2))
plot(gam_lo_span_final, se = T, col = "red")
gam_lo_final <- gam(logwage ~ lo(age, span = 0.5) + education + jobclass + health + health_ins + maritl + year + race, data = train_data)
gam_lo_final_train_preds <- predict(gam_lo_final, newdata = train_data)
gam_lo_final_train_rmse <- sqrt(mean((train_data$logwage - gam_lo_final_train_preds)^2))
gam_lo_final_train_rmse
[1] 0.272861
gam_lo_final_test_preds <- predict(gam_lo_final, newdata = test_data)
gam_lo_final_test_rmse <- sqrt(mean((test_data$logwage - gam_lo_final_test_preds)^2))
gam_lo_final_test_rmse
[1] 0.279907
par(mfrow = c(2, 2))
plot(gam_lo_final, se = T, col = "blue")
Span of 0.5 provides balance of RMSE and smoothness (sort of “elbow” after)